The Missing Link: Data Analysis with Missing Information
نویسنده
چکیده
How do you handle missing data? Deletion of those subjects frequently leads to biased outcomes. Mean imputation assumes that non-responders are no different than responders, and can bias variances toward zero. Last observation carried forward methods, while still often used, can cause bias and even induce an apparent treatment effect. Multiple imputation is an improved method to deal with these issues. This paper will focus on the Markov chain – Monte Carlo based method of multiple imputation using SAS’s PROCs MI and MIANALYZE. INTRODUCTION One of the leading concerns in data analysis is how to appropriately incorporate missing information. This paper will discuss different types and causes of missing data, review case deletion and single imputation methods, and discuss the use of multiple imputation, with a focus on the MCMC-based method for arbitrary missing data patterns. While single imputation may not have a significant effect on subsequent analyses when only small amounts of information are missing, it frequently adds bias and distorts the relationship between variables. In contrast, multiple imputation computes statistics based on several different datasets, which maintain the relationships between variables and provide a measure of the uncertainty in the estimates. The attached appendix provides notated SAS code for a variety of relationships between variables utilizing PROCs MI and MIANALYZE. CAUSES AND CLASSIFICATIONS OF MISSING INFORMATION Missing information can occur in data for a variety of reasons. In clinical trials, information is typically collected at scheduled visits. A subject in a clinical trial may not complete all items on a questionnaire, or may miss an entire visit, resulting in no data collection at that time point. Subjects may miss visits for reasons unrelated to the study, such as transportation difficulties, or for reasons potentially related to study medications, such as experiencing an adverse event. Similarly, item-level missing information can occur for both study-related and unrelated reasons. It is not uncommon for these types of missing data to be followed by observed data at subsequent time points. This pattern of missing data may be followed by observed data, are considered an arbitrary pattern of missing data. On the other hand, if a subject withdraws from the study or dies, future scheduled observations will be missing. In other words, if an observation is missing, all subsequent observations are also missing. This is called a monotone pattern of missing data, and allows for more flexibility in analysis choices than arbitrary patterns. MISSINGNESS MECHANISMS We shall refer to the matrix of complete data as Y, which is composed of columns of p variables and rows of n subjects. Y can be separated into two parts: Yobs, the observed data, and Ymis, the missing data. We can also create a matrix R of response indicators (elements rij are 0 if yij is missing, 1 if observed). The simplest case, missing completely at random (MCAR), implies a random, arbitrary pattern to the missing data. In other words, subjects with missing data are like a random subsample of the data; there is no difference between responders and non-responders. For instance, if every subject was equally likely to record his or her weight, the missing weight data would be missing completely at random. In technical terms, MCAR indicates that the probability of missingness is independent of the data. In other words, P(R|Y) = P(R|Yobs,Ymis) = P(R) Unfortunately, the MCAR assumption is rarely realized. A less restrictive, and more realistic scenario is that the data are missing at random (MAR). This assumption states that the probability of missingness depends only upon observed variables. For example, if women were less likely to record their weight, and gender was recorded for every subject, the probability of missing information would be MAR. Obviously, this assumption becomes more likely as more variables are recorded. It is possible to test the MCAR assumption against the MAR, although multiple imputation methods can be used on MCAR data as well as MAR data. Since the missingness is independent of the unobserved data, the missing at random assumption can also be written as P(R|Y) = P(R|Yobs,Ymis) = P(R|Yobs)
منابع مشابه
Probabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملA method to solve the problem of missing data, outlier data and noisy data in order to improve the performance of human and information interaction
Abstract Purpose: Errors in data collection and failure to pay attention to data that are noisy in the collection process for any reason cause problems in data-based analysis and, as a result, wrong decision-making. Therefore, solving the problem of missing or noisy data before processing and analysis is of vital importance in analytical systems. The purpose of this paper is to provide a metho...
متن کاملDEA with Missing Data: An Interval Data Assignment Approach
In the classical data envelopment analysis (DEA) models, inputs and outputs are assumed as known variables, and these models cannot deal with unknown amounts of variables directly. In recent years, there are few researches on handling missing data. This paper suggests a new interval based approach to apply missing data, which is the modified version of Kousmanen (2009) approach. First, the prop...
متن کاملچند رویکرد برخورد با مقادیر گمشده متغیرهای کمی و بررسی اثر آنها بر نتایج حاصل از یک کارآزمایی بالینی
Background and Objectives: A major challenge that affects the longitudinal studies is the problem of missing data. Missing in the data may result in the loss of part of the information which reduces the accuracy of the estimator and obtain the results will be biased and inaccurate. Therefore, it is necessary to evaluate the missing data mechanism from a longitudinal research and to consider thi...
متن کاملStage Life Testing with Missing Stage Information - an EM-Algorithm Approach
We consider a stage life testing model and assume that the information at which levels the failures occurred is not available. In order to find estimates for the lifetime distribution parameters, we propose an EM-algorithm approach which interprets the lack of knowledge about the stages as missing information. Furthermore, we illustrate the implementation difficulties caused by an increasing nu...
متن کاملA Bayesian Approach to Estimate Parameters of a Random Coefficient Transition Binary Logistic Model with Non-monotone Missing Pattern and some Sensitivity Analyses
A transition binary logistic model with random coefficients is proposed to model the unemployment statues of household members in two seasons of spring and summer. Data correspond to the labor force survey performed by Statistical Center of Iran in 2006. This model is introduced to take into account two kinds of correlation in the data one due to the longitudinal nature o...
متن کامل